Performance of Gene Name Recognition Tools on Patents
نویسندگان
چکیده
The accurate identification of gene and protein names in patents is an essential step in many commercially highly relevant applications, such as patent retrieval, prior art search, or patent classification. Since patents exhibit a number of properties that make them quite different from scientific articles, it is questionable whether tools developed for the latter sort of texts will work equally well for the former. Answering this question is aggravated by the fact that only few annotated patent corpora exist which makes training hard. In this paper, we report on a comparative evaluation of four existing gene/protein named entity recognition and normalization tools trained on scientific articles regarding their performance on the two patent corpora. We analyze the tools with respect to different evaluation metrics to highlight their respective strengths and limitations. Our results reveal that the performances of these tools over patents are generally lower than for scientific articles. Exemplified by one of the four tools, we also show that training on annotated patents considerably improves performance on patent corpora. We conclude that more efforts must be taken to produce adequate training data for working with patents. keywords: Patent Mining, Named Entity Recognition, Named Entity Normalization, Gene and Protein Entities, Performance Measurements.
منابع مشابه
Adapting ChER for the recognition of chemical mentions in patents
ChER (Chemical Entity Recogniser) is a pipeline of natural language processing tools optimised for the recognition of chemical names in scientific abstracts. It formed the basis of our submissions to the previous edition of the CHEMDNER track in BioCreative IV, and was one of the top-performing systems both for the chemical document indexing (CDI) and chemical entity mention recognition (CEM) s...
متن کاملMining Patents with tmChem, GNormPlus and an Ensemble of Open Systems
The significant amount of medicinal chemistry information contained in patents make them an attractive target for text mining. The CHEMDNER task at BioCreative V focused on information extraction from patents. This manuscript describes our submissions to the CEMP (chemical named entity recognition) and GPRO (gene and related object identification) subtasks. Our CEMP submission is an ensemble of...
متن کاملNeji: Recognition of Chemical and Gene Mentions in Patent Texts
The BioCreative V.5 challenge focused on the recognition of chemicals and gene mentions in medicinal chemistry patents. For participation in the chemical entity (CEMP) and gene and protein (GPRO) recognition tasks, we used the concept recognition framework Neji and applied a machine-learning strategy using a optimized feature set. Our best submissions achieved an F-score of 86.6% for the identi...
متن کاملIdentification of chemical and gene mentions in patent texts using feature-rich conditional random fields
This article describes the application of Neji, a text-processing and concept recognition framework, to the automatic recognition of chemicals and gene mentions in medicinal chemistry patents. We used conditional random fields models trained with a otimized set of features including linguistic, orthographic, morphological, dictionary matching and local context features, dictionary-matching, and...
متن کاملEvaluation of chemical and gene/protein entity recognition systems at BioCreative V.5: the CEMP and GPRO patents tracks
This paper presents the results of the BioCreative V.5 offline tasks related to the evaluation of the performance as well as assess progress made by strategies used for the automatic recognition of mentions of chemical names and gene in running text of medicinal chemistry patent abstracts. A total of 21 teams submitted results for at least one of these tasks. The CEMP (chemical entity mention i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016